| Author | |
|---|---|
| Name | Claire Descombes |
| Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
| Degree | MSc Statistics and Data Science, University of Bern |
| Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
When you get a file from somewhere on your computer (e.g. a dataset), you can either
The advantage of putting the files in the folder that contains your script and is set as the working directory is that you can easily move the folder around on your computer without getting any problems with your script: just set the working directory to your source file every time you open it, and you’ll be fine.
# Example
setwd("~/path/to/your/folder/")
data <- read.csv("testdata.csv")
The advantage of always giving the full path to a file is that you can get data in different folders on your computer, avoiding things like copying the source data in every folder where you have a corresponding script.
# Example
data <- read.csv("~/path/to/your/folder/testdata.csv")
Working directory
To tell R which folder you are working in (e.g., where your data is stored), you have several options:
setwd("path/to/your/folder") in your script.💡 Tip: To avoid file path errors and keep your project organized,
it’s best to store your script and data files in the same folder — or at
least place your data files in a subfolder like data_sets
within your project directory. Then, set that folder as your working
directory in R. This ensures that your code can reliably find and save
your files.
getwd() # Displays the current working directory
setwd("~/path/to/your/folder") # Sets the working directory
We will first look at how to import a CSV file into R as a data frame.
CSV stands for Comma-Separated Values. In a .csv file,
the values are stored as plain text, separated by commas. This is a
simple and widely used format for storing tabular data.
After setting your working directory or determining the path to your
CSV file, you can use the read.csv() function to import the
data. This will create a data frame, which is one of the most commonly
used structures in R for handling datasets.
# Import a CSV file into a data frame
dataset <- read.csv("~/path/to/your/folder/data.csv")
💡 I recommend using data frames — they are generally easier to work with than matrices, especially for beginners.
Another widely used data format is the Excel file
(.xlsx). For these, you can use the readxl
package to import the data:
# Load the readxl package (after installing it)
library(readxl)
# Read the first sheet of an Excel file
dataset <- read_excel("~/path/to/your/folder/data.xlsx")
⚠️ Note: If your file is actually a CSV but mistakenly has a .xlsx extension, you should rename it to .csv and use read.csv() instead. Mixing up formats can lead to import errors.
Let us now look at real data frames to learn how to call or modify
their elements. To do this, we will use multiple health data sets from
the National Health and Nutrition Examination (NHANES) Survey
from 2011-2012. The survey assessed overall health and nutrition of
adults and children in the United States and was conducted by the
National Center for Health Statistics (NCHS). The data sets can be found
in the data_sets
folder.
| Dataset | NHANES Code | Description | CSV File |
|---|---|---|---|
| Demographics | DEMO_G | Age, sex, race/ethnicity, income, education | DEMO_G.csv |
| Blood Pressure | BPX_G | Systolic/diastolic blood pressure, number of readings | BPX_G.csv |
| Body Measures | BMX_G | Height, weight, BMI, waist circumference | BMX_G.csv |
| Smoking Questionnaire | SMQ_G | Smoking habits, exposure to secondhand smoke | SMQ_G.csv |
# Load the necessary CSV files into data frames
demo <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/DEMO_G.csv") # Demographics (cycle G = 2011–2012)
bpx <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/BPX_G.csv") # Blood pressure
bmx <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/BMX_G.csv") # Body measures
smq <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/SMQ_G.csv") # Smoking questionnaire
✏️ Exercise on the NHANES data sets n°1: import the
demo, bpx, bmx and
smq data sets from the data_sets
folder into R.
💡 The codebook for each data set can be accessed either on the NCHS website
or directly in R using the function
nhanesCodebook(nh_table, colname) from the package
nhanesA (which I used to download the data). You’ll also
find a summary of key variables from each data set at the end of this
chapter.
Being able to access elements in a data frame is essential when working with data. Here are some common methods to select specific elements, rows, or columns.
# Look at the first respectively last few rows
head(demo)
tail(demo)
# Select columns by name
demo[, c("RIDAGEYR", "RIAGENDR")] # Selecting age in years and gender
vars <- c("RIDAGEYR", "RIAGENDR")
demo[, vars] # Alternative using variable `vars`
# Select elements by position
demo[1, 1] # Access the first element of the first column (the respondent sequence number of the 1st participant)
## [1] 62161
ind_mat <- cbind(c(1, 3, 5), c(2, 4, 6))
demo[ind_mat] # Access rows and columns using multiple indices
## [1] "NHANES 2011-2012 public release" "Male"
## [3] NA
# Select rows based on a condition
head(demo[, "RIDAGEYR"] > 50) # Logical condition for age greater than 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(!(demo[, "DMDHHSIZ"] > 3)) # Logical negation for total number of people in the household not greater than 3
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
demo[demo[, "RIDAGEYR"] > 50, ] # Rows where age > 50
demo[demo[, "DMDHHSIZ"] < 3, ] # Rows where total number of people in the household greater than 3
demo[demo[, "DMDHHSIZ"] >= 3, ] # Rows where total number of people in the household greater or equal 3
# Combine logical vectors using "&" (AND), "|" (OR), and "!" (NOT)
demo[(demo[, "RIDAGEYR"] > 50 & demo[, "RIAGENDR"] == "Female"), ] # Both conditions must be true
demo[(demo[, "DMDHHSIZ"] < 3 | demo[, "RIAGENDR"] == "Male"), ] # One condition must be true